feat: gate pilot — LLM at batch decision boundaries#1
Draft
logpie wants to merge 2 commits into
Draft
Conversation
Adds a gate pilot that replaces the simple replan() call after batch failures. The pilot reads disk artifacts (verify logs, QA verdicts, task summaries, learnings) and returns structured decisions: failure analysis, retry strategies, routed context for upcoming tasks, skip recommendations, and re-batching. Key design: - Stateless: reconstructs context from files each invocation - No telephone game: pilot makes system-level decisions, coding agents interpret their own errors directly - Structured JSON output, orchestrator validates and applies - Same model as planner (configurable via planner_model) - Falls back to replan() on parse failure - Config flag: pilot: false in otto.yaml to disable - Zero overhead when no failures (pilot only invoked at batch boundary with failures + remaining tasks) Codex-reviewed: 3 rounds, all CRITICAL/IMPORTANT findings fixed, APPROVED. Benchmark: 53 tasks across 18 runs, 0 regressions, 0 pilot overhead. Pilot not yet validated on real failures — shipping as safe no-op upgrade for i2p readiness. Will prove value at scale (5+ tasks, multiple batches). New files: - otto/pilot.py — context assembly, LLM invocation, decision parsing - tests/test_pilot.py — 22 unit tests - tests/test_pilot_benchmark.py — 6 scenario benchmark tests - bench/pilot-benchmark.sh — A/B benchmark runner - bench/pressure/projects/pilot-test-* — 3 synthetic test projects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Supersedes the gates + gate pilot approach. Simplified to 5 steps: classify → plan → execute → verify → fix-or-replan. Key decisions: - Single-task is a valid plan (no forced decomposition) - Product artifacts at project root (not otto_arch/) - Persistent context.md accumulates across tasks - Vertical slices over horizontal layers - User journeys from user's perspective, not feature list - Fix rounds continue while making progress, replan on planning failures - Codex-reviewed design Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
Apr 26, 2026
Closes 1 CRITICAL + 2 IMPORTANT findings from the mc-audit hunters. CRITICAL closed: - Logs appended unbounded into client state; 10MB log could lock the browser (codex-long-string-overflow #1). Replaced with bounded ring buffer (1MB max), separate unbounded totalBytes/totalLines counters, droppedBytes tracker for the elided-bytes header. IMPORTANT closed: - LogPane didn't distinguish "Live, polling" from "Final" state (codex-evidence-trustworthiness #5). Header now shows live polling cadence + last-update age, OR final size + line count. - Missing log file rendered as generic "waiting for output"; fetch errors toasted while polling kept hammering (codex-error-empty-states #9). Now shows the path explicitly, plus an error state with Retry button. Polling pauses when log is missing/errored. Polling resilience: - Exponential backoff on consecutive errors: 1.2s → 2s → 5s → 15s → 30s. - Resets to 1.2s on first successful read. - Stops polling when run is terminal AND fully drained (uses new server `eof` field). - Pauses polling when inspector is closed or tab is hidden (visibilitychange listener); resumes on visible. Server changes: - `LogReadResult` gains `total_bytes` (file size at read time) and `eof` (whether next_offset == total_bytes after this slice). All three constructor sites populated. Lets the client render "Final · {size}" headers and detect drain without a second HEAD request. - `LogsResponse` TS type updated. Tests: - `tests/browser/test_log_buffering.py` — 7 paired Playwright tests: - 5MB log renders <1.5MB DOM with elided-bytes header - Live state + polling header for active runs - Final state + line count for terminal runs - Missing-file path display + paused polling - Error backoff schedule (gap_first ≈ 2s, gap_second ≈ 5s) - Polling stops on inspector close - Polling stops on tab hidden - Browser suite: 15 passed (7 new + 7 cluster A + 1 smoke) - Default suite: 1076 passed (no regressions) - npm run web:typecheck: clean - npm run web:build: 277.82 kB JS / 33.34 kB CSS Note: pre-existing basedpyright warnings in service.py around `_record_event` calls (lines 393-544) are not introduced by this commit; they predate cluster B and are flagged because basedpyright now analyzes the file when it's touched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
Apr 26, 2026
CLUSTER C — bundle/build integrity (closes 1 CRITICAL + 4 IMPORTANT + 1 NOTE): CRITICAL closed: - `app.py:35`: `otto web` served checked-in bundle without verifying it matches source — developer skips `npm run web:build` → silent stale UI (codex-packaging-bundle #1). IMPORTANT closed: - Python build never built frontend; `pip install -e .` shipped whatever static was in tree (#2). Vite plugin emits `build-stamp.json` on every build with source-hash + timestamp + git commit; FastAPI startup verifies freshness via `verify_bundle_freshness()`. - `web:build` ran only Vite; `web:typecheck` advisory; no CI gates (#3). New `web:verify` script chains typecheck → build → committed-check. - Default cache headers underused hashed assets (#4). New `_CacheHeaderStaticFiles` subclass: `no-store` for shell + index.html, `public, max-age=31536000, immutable` for `/static/assets/*`. - Server didn't validate `index.html` referenced JS/CSS exist (#5). Startup parses index.html and asserts every referenced static path resolves; missing → fail-fast with `npm run web:build` guidance. NOTE closed: - `[tool.setuptools.package-data]` was flat (`static/*`, `static/assets/*`); future nested assets (fonts/, images/, locale/) would silently miss wheels. Now `static/**/*` recursive glob. Files: otto/web/bundle.py (new, 263 lines), otto/web/client/vite.config.ts (build-stamp emitter), scripts/build_stamp.py (CLI for manual stamp), scripts/check_bundle_committed.py (git-diff guard for CI), package.json (web:verify script), pyproject.toml (recursive package_data), otto/web/app.py (verify_bundle_freshness call + _CacheHeaderStaticFiles). Tests: tests/test_web_bundle_freshness.py (5 tests) + tests/test_web_cache_headers.py (2 tests, 8 actual checks). 15 new server-layer tests, all green. CLUSTER D — history pagination (closes 1 CRITICAL + 2 IMPORTANT): CRITICAL closed: - `total_rows=247` displayed but only first page rendered; power user with 200+ runs stuck on page 1 (heavy-user, codex-state-management #6, codex-long-string-overflow #3). IMPORTANT closed: - `/api/state` accepted `history_page` + `page_size`; client never sent `history_page` and rendered no controls. - Page-size selector now lives in the UI (10/25/50/100, default 25); server clamps to [1, 200] to refuse stale URLs requesting unbounded slices. Implementation: - `MissionControlFilters.history_page_size: int | None` for per-request override (server clamps to safe range). - App.tsx History pane: pagination footer (Page N of M · X runs · ←/→ · jump-to + page-size selector). URL persists `hp` + `ps` query params. - Filter changes reset page to 1. - Stale deep-link `?hp=99` (out-of-range) → "Page 99 doesn't exist; jump to page 1" with reset button. Files: otto/mission_control/{model.py,serializers.py} (history_page_size plumbing), otto/web/app.py (param wiring), otto/web/client/src/{App.tsx, api.ts,types.ts,styles.css} (pagination UI), tests/browser/test_history_pagination.py (10 paired Playwright tests, all green). Verification: - Browser suite: 25 passed (8 cluster A + 7 cluster B + 10 cluster D + new smoke set from cluster C cache headers via tests/test_web_cache_headers.py) - Default suite: 1091 passed (was 1076; +15 cluster C server tests) - npm run web:typecheck: clean - npm run web:build: 277.82 kB JS / 33.34 kB CSS (rebuilt with stamp) Note: cluster C agent hit an API overload during its final summary step but all files landed cleanly on disk; verified by independently running the new test files before committing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
Apr 26, 2026
CLUSTER E — diff freshness contract (closes 1 CRITICAL + 1 IMPORTANT): CRITICAL closed: - Diff fetched once and held client-side without target/branch SHAs or merge-base; **the code merged could differ from the diff reviewed** (codex-evidence-trustworthiness #1). IMPORTANT closed: - Diff truncation was bare `truncated` suffix; user couldn't tell how much was hidden, no full-diff download path (codex-evidence-trustworthiness #2). Implementation: - `MissionControlService.diff()` enriches response with: `fetched_at`, `target_sha`, `branch_sha`, `merge_base`, `command`, `limit_chars`, `full_size_chars`, `shown_hunks`, `total_hunks`, `errors[]`. - `MissionControlService._validate_expected_diff_shas` — merge action rejects with 409 + "Re-fetch the diff to confirm what will be merged" when `expected_target_sha` / `expected_branch_sha` differ from live HEAD. - POST /api/runs/{id}/actions/{action} forwards SHAs from request body. - DiffPane renders freshness header (captured-X-ago + target/branch/base short SHAs with full-SHA tooltip; warnings when SHAs are null) + Refresh button + truncation banner ("Showing N hunks of M · X KB of Y MB" + Copy diff command). - Merge confirm dialog spells out "Land branch {short} @ {sha} into target {short} @ {sha}" with the actual SHAs. - SPA passes SHAs from most-recent diff fetch on every merge POST. CLUSTER F — boot-loading gate + first-run clarity (closes 2 CRITICAL + ~12 IMPORTANT): CRITICAL closed: - App.tsx tri-state boot gate (`loading | launcher | ready`) — main shell no longer renders before /api/projects returns; "New job" button can no longer be enabled with project undefined (codex-first-time-user #1). - Pre-submit advanced-options summary in JobDialog: "Will run with: claude · sonnet · effort=high · verification=fast" outside Advanced details with "Edit" link (codex-first-time-user #2). IMPORTANT closed: - Launcher subhead: "Otto runs AI coding jobs in isolated git worktrees, then lets you review logs, diffs, and merge results." - "Managed root" helper text explains current-repo isolation - Empty project list: "Create your first Otto project below" + auto-focus - First-run primary CTA: "Start first build" (reverts to "New job" once any run exists) - Build/Improve/Certify dropdown options gain helper descriptions - All commands now require non-empty intent or focus (was build-only) - Dirty-project confirm lists up to 5 dirty files with "+N more" - "Start queued job" CTA when watcher is stopped + jobs queued - Empty detail copy: "Select a task card to review logs, code changes, verification, and next action." - RunInspector tab labels: jargon-soft alternatives - Recovery actions surfaced as primary contextual buttons (Retry / Resume / Cleanup) next to run header — Advanced still has full list - HTTP-code-to-actionable-copy mapping in api.ts: 409/400/403/5xx → recovery messages (no more raw "HTTP 409") Tests: - tests/test_diff_freshness.py — 6 server tests - tests/browser/test_diff_freshness.py — 5 browser tests - tests/browser/test_first_run_clarity.py — 13 browser tests - All green; default suite 1097 (was 1091; +6 cluster E server) - Browser suite 47 (was 25; +5 cluster E + +17 cluster F) - npm web:typecheck clean; bundle rebuilt Followups for orchestrator: - SHA-mismatch refusal is opt-in by client (older callers omit SHAs and bypass the gate); consider promoting to power-user opt-out flag - Pre-fetch diff inside runActionForRun("merge") so SHA gate is non-bypassable from SPA - Related provenance gaps (proof drawer cache, ArtifactRef metadata, visual-evidence manifest, proof file digest) share same architectural fix; bundle into a future cluster Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
Apr 26, 2026
…/launcher/lost-connection) ============================================================ 5 IMPORTANT closures ============================================================ W7-IMPORTANT-1 — iPhone submit button below the fold In devices["iPhone 14"] viewport (390x664), JobDialog submit at y=674 was clipped without scroll affordance. JobDialog scroll- container now uses max-height: 100dvh - 80px and overflow-y: auto; submit button always reachable via scroll OR the dialog renders a floating bottom-bar on narrow viewports. W8-IMPORTANT-1 — JobDialog ignores Cmd+Enter from textarea Power-user shortcut was dead. Added onKeyDown to intent textarea: (Cmd|Ctrl)+Enter now triggers submit when validation passes. W9-IMPORTANT-2 — Run double-rendered after terminal_outcome Live[]/history[] transition not atomic — same run_id appeared in both, UI rendered twice. Client-side dedupe: when computing rows to render, exclude live items whose run_id appears in history with terminal_outcome set. Codex first-time-user #4 — "Managed root" looked like current repo disappeared. Launcher panel adds: "Otto manages projects in isolated git worktrees so it never touches your other repos. Pick or create one below to start." Codex error-empty-states #1 — Lost-connection banner When polling fails 3+ consecutive times, sticky banner appears: "Lost connection to Mission Control. Retrying every 5s..." with manual retry button. Auto-clears on first successful poll after. ============================================================ Tests added (14 new browser tests) ============================================================ - tests/browser/test_iphone_submit_button_reachable.py (3 tests) - tests/browser/test_job_dialog_cmd_enter.py (3 tests) - tests/browser/test_no_double_render_after_terminal.py (4 tests) - tests/browser/test_launcher_managed_root_explanation.py (1 test) - tests/browser/test_lost_connection_banner.py (3 tests) ============================================================ Test counts ============================================================ - Default: 1189 (no change; all UI fixes) - Browser: 198 effective (was 184; +14 new) - npm web:typecheck: clean - npm web:build: clean ============================================================ Tally ============================================================ CRITICAL: 29 of 29 closed (100%) IMPORTANT: ~113 of 132 closed (~86%) Followups: - ~19 NOTE-tier IMPORTANTs remain (paper cuts; deferred per severity-gate rule unless adjacent) - 76 NOTE items deferred per policy - Phase 3.5 R1-R14 actual recordings ($70-140, hours of real LLM) remain scaffolded but not captured This effectively closes the bulk of Phase 4 IMPORTANT work. Branch is in releasable state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 6, 2026
The architectural design doc (docs/intent-to-product-design.md) was the canonical reference that drove the redesign — adding it as item #1 in the read-in-this-order list ahead of progress.md / research.md / plan.md. Fixes the duplicate "4." numbering left by the prior edit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
Three deterministic RED-first repros for the concurrency + recursive- decomposition regime, exercising real code paths (only the agent step is an injected callable; no LLM, no spec-compile, no 90-min run; <5s total): - #1 depth-3 dual-subtree concurrent propagation: GREEN — owning-worktree propagation (regression guard for 89d4bad) + blocked-slice isolation with structured reason are sound. Banked as permanent regression. - #2 5-way concurrent merge into one integration branch: RED — one-shot union repair against a moving integration target drops already-landed sibling contributions (final routes [route-0,route-2], expected all 5) instead of bounded re-entry. Composition of flock + union guard + seam re-entry under 5-way race is broken. - #3 task_graph/spec-state concurrent terminal writes (32 children): RED — spec_state.append_event derives event_id from an unlocked line-count read (spec_state.py:328-333), so concurrent writers stale-read the same count and produce duplicate/skewed event_ids, violating the documented stable-unique contract amendments rely on (trigger_event_id linkage). No production code changed — triage/repro pass only. Fixes dispatched next. Repros built by Codex; root causes confirmed against real spec_state code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…rm contract-write invariant Plan-Gate-APPROVED redesign step S1 (builds on S0 primitive). Flips repro scene #1 GREEN; scenes #2-#5 stay RED (S1 flips only #1). - Scheduler-ordering invariant in `_process_children`: no `feature` child dispatches while a sibling `foundation` task for that parent is pending/in-flight/unverified or the parent foundation_contracts are absent/invalid — even without `depends_on` (scheduler invariant, not prompt). - Isolation gate beside `check_route_registration_isolation`: pre feature-dispatch, every foundation_contract path must be exclusively owned; an overlapping/nested feature owned_path → re-enter the architect via the existing `_reenter_or_block_architect_contract` bounded machinery with `kind="shared_foundation_not_isolated"`; bounded exhaustion → structured terminal (no crash, no silent dispatch). - Uniform contract-write invariant: a task may write a foundation_contract path only if it is the contract's owner_task_id or a contract_amendment for it (S1 emits a structured `foundation_contract_write_blocked` and does not advance the branch; S2 adds the amendment routing). ONE shared helper applied at ALL 8 enumerated v5 commit/merge admission hooks (preflight repair, integ- agent commit, root-inline, subtree-prop repair, child-verify repair, scaffold repair, _merge_child_branch pre-commit dirty/untracked + pre-merge committed child-branch delta, merge-conflict repair). - Plan-Gate must-have: `_task_entry_allows_upward_merge` / `_child_result_allows_upward_merge` now consult durable merge_blocked graph state so a stale in-memory `pass` cannot bypass the S1 gate. - Secondary: legacy `detect_scope_violations` treats a newly-created critical shared-contract path as a violation (defense-in-depth; v5 gate is primary). Verified: scene #1 GREEN, #2-#5 RED; 19 S1+S0 units GREEN; broad v5/seam suite shows ONLY the 4 known pre-existing test_v5_phase2 git-worktree-rot failures (no new); ruff clean. S0 untouched. Codex-implemented; Claude-reviewed (scope, no S2-S5 leak, scene-#1 not-gamed, stale-pass must-have, 8-hook enumeration). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…duler hold, honest terminal S1 Gate R1 (3 CRITICAL + 2 IMPORTANT; 8-hook enumeration + no-S2-leak confirmed): - CRITICAL-1: scheduler allowed feature dispatch when foundation contracts not yet declared even with a foundation sibling. Now an explicit task_role=="foundation" sibling HOLDS all feature dispatch until every foundation sibling is mergeable/verified AND valid parent foundation_contracts are present; pure-feature decomposition (no foundation sibling) still bypasses. - CRITICAL-2: terminal-blocked foundation silently stranded features (dropped from ready + loop break). Now affected ready/pending feature children are honestly marked merge_blocked kind="foundation_unsatisfied" (no silent drop). - CRITICAL-3 (real capstone root cause): isolation gate only caught exact contract-path overlap, NOT a feature owned_path nested under the foundation owner's broad tree (foundation owns backend/, contract backend/auth.py, feature owns backend/routers/auth.py). Now a foundation-owned tree covering a declared contract is exclusive; nested sibling feature paths are rejected via the existing shared_foundation_not_isolated architect re-entry. Repro scene #1 strengthened to the nested capstone shape (verified RED on old db5b819, GREEN on fix — genuine, not gamed). - IMPORTANT-4: `_task_entry_allows_upward_merge` now also rejects a stale verdict=="pass" carrying durable merge_blocked metadata (was only fixed in `_child_result_allows_upward_merge`). - IMPORTANT-5: removed the over-broad any-contract_amendment write allow (S2 will reintroduce a bound amendment→contract allow); integration-of-record allow explicitly deferred, not left ambiguous. Verified: scene #1 GREEN (nested shape), #2-#5 RED; 21 S1+S0 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 git-worktree-rot failures; ruff clean. S0 untouched. Codex-fixed (Codex-found via Impl Gate R1); Claude-reviewed + RED-on-old verified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…er / honest terminal) S1 Gate R2 (2 CRITICAL; nested-tree fix + IMPORTANT-4/5 confirmed correct R2, untouched). Both were the same disease — a feature child silently stranded pending instead of an honest structured terminal (the capstone hang class): - CRITICAL-1: foundation PASSED but never produced valid foundation_contracts → features were silently held/dropped with no re-entry or terminalization (the S1 test masked it by injecting contracts externally between _process_children calls). Now: passed/ mergeable foundation + absent/invalid contracts re-enters the architect via the existing bounded `_reenter_or_block_architect _contract` machinery (kind=foundation_contracts_missing_after_pass); on bounded exhaustion the dependent features are honestly merge_blocked — never left pending awaiting external metadata mutation. Test rewritten to assert the real runtime transition (no external contract injection). - CRITICAL-2: terminalization only ran when ready_features non-empty, but a depends_on=["foundation"] feature is NOT ready when foundation failed (deps unsatisfied), so it stayed silently pending. Now terminal-foundation handling scans ALL same-parent unmerged feature siblings (not just ready) and marks each merge_blocked kind="foundation_unsatisfied"; covers foundation merge_blocked AND catastrophic/failed. Tradeoff: terminal handling uses task-graph siblings as source of truth (not orphan pending JSONL) — consistent with the existing scheduler/metadata model, no new lifecycle channel. Verified: scene #1 GREEN (nested capstone shape), #2-#5 RED; 22 S1+S0 units GREEN incl. 2 strengthened/added scheduler tests asserting real runtime transitions; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. S0 + nested-tree fix untouched. Codex-fixed (Codex-found via Impl Gate R2); Claude-reviewed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
Plan-Gate-APPROVED redesign step S2 (builds on S0/S1). Flips repro scene #5 GREEN (full lifecycle); scenes #2/#3/#4 stay RED. - _merge_child_branch union-feedback: when the union/conflict path is a declared foundation_contract the child does NOT own, no longer routes the repair to the leaf (the b15 scope-gate deadlock). Instead schedules a runnable task_role=="contract_amendment" task owned by the contract's owner_task_id, owned_paths=[contract], emits foundation_contract_amendment_repair. - Net-new lifecycle (task_graph): set_contract_amendment_blocked records last_agent_verdict, CLEARS verdict/completed_at (un-non- runnable), sets non-terminal blocked_pending_contract_amendment + blocked_on_task_id; clear_contract_amendment_blocked_tasks clears ALL leaves blocked on an amendment (Plan-Gate must-have #3, not just the first). - take_ready: new blocked_on_task_id gate (analogous to depends_on) — a leaf with an unsatisfied blocked_on is skipped, not dispatched/ terminal. - Amendment terminal-PASS → clear all blocked leaves + re-enqueue merge-only retry (scheduler re-entry w/ contract_amendment_retry_merge metadata, reuses pending/lease machinery, bypasses Lead, retries only _merge_child_branch). Amendment terminal-FAIL → each leaf honest merge_blocked (no silent hang). - Reintroduced the BOUND contract_amendment write-allow S1 removed: an amendment may write only its bound contract (owner/path match via task metadata), not any contract. - Blocked graph state authoritative over stale in-memory LeadResult(pass) (Plan-Gate must-have #2; composes with S1's hardening). Verified: scene #5 strengthened to full-lifecycle assertion (verified RED on pre-S2 78535d1, GREEN now — not gamed); scenes #1/#5 GREEN, #2/#3/#4 RED; 27 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. S0/S1 untouched. Codex-implemented; Claude-reviewed (scope, lifecycle, RED-on-old). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…tlement, bound writes, bounded churn S2 Gate R1 (2 CRITICAL + 1 IMPORTANT + 1 NOTE) — the silent-hang / double-merge class: - CRITICAL-1: merge-only retry never persisted the restored terminal verdict (only in-memory) → after restart the graph had verdict=None and re-dispatched the leaf (double-merge); the scene masked it via a fake set_verdict. Now the restored `pass` is persisted ONLY after _merge_child_branch really succeeds AND a durable graph re-read shows no fresh block/retry/merge_blocked — idempotent, no restart double-merge. - CRITICAL-2: an amendment _run_child CRASH set it catastrophic without running fail-settlement → blocked leaves kept blocked_on_task_id forever (take_ready skips them = silent hang). Now ANY amendment terminalization (crash/catastrophic/failed/merge_blocked) runs _settle_contract_amendment_dependents → every blocked leaf becomes honest merge_blocked. - IMPORTANT-3: bound write-allow still let a contract_amendment task modify arbitrary NON-contract files (gate only flagged contract-overlapping paths). Now a contract_amendment task may write ONLY its bound contract path; any other changed path is rejected. - NOTE-4: futile amendment churn was unbounded (pass-without-fix → schedule another amendment forever). Now bounded per (leaf, contract) (cap=2: initial + 1 retry, matching existing bounded-retry style) → honest structured merge_blocked on exhaustion. +4 regressions: durable verdict after real (non-fake) retry + no re-dispatch; amendment crash settles all blocked leaves; amendment cannot write a non-contract file; futile-amendment bounded → terminal. Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 30 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. S0/S1 untouched. Codex-fixed (Codex-found via Impl Gate R1); Claude-reviewed. Tradeoff: amendment retry cap=2 (small bounded style). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…durable in-progress state) S2 Gate R2 (1 IMPORTANT; C2/I3/N4 confirmed correct R2). The merge-only-retry flag was cleared BEFORE the merge, leaving a window (verdict=None, blocked_on=None, retry=False) where a crash/restart or a second runner could re-dispatch the leaf via take_ready (in-process lease only) → double-merge class. Fix: durable `contract_amendment_retry_in_progress` set atomically when entering merge-only retry (no longer pre-clears contract_amendment_retry_merge); `take_ready` treats it non-runnable so empty-in-flight / crash-restart / second-runner cannot re-dispatch the leaf as an ordinary task; cleared ONLY atomically (single graph lock/write) with the terminal outcome — success persists `pass` + both flags; merge_blocked persists terminal + flags; fresh re-block clears stale retry flags atomically with blocked_on_task_id (preserving last_agent_verdict). Fails-closed during the window; idempotent on restart (resume/settle, never double-merge). +1 regression: simulates the exact in-retry window pre-durable-pass and asserts fresh take_ready(in_flight=set()) does NOT return the leaf. Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 31 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. S0/S1 + R1 fixes (C2/I3/N4) untouched. Codex-fixed (Codex-found via Impl Gate R2); Claude-reviewed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…le-recovery S2 Gate R3 (2 CRITICAL): the R2 durable-in-progress fix closed the in-process window but (1) left the second-runner race open (mark_..._in_progress wasn't compare-and-set; _run_child ignored its return) and (2) introduced a crash/restart DEADLOCK (stale in-progress, no recovery — traded double-merge for permanent stuck). - Atomic claim: `mark_contract_amendment_retry_in_progress` is now a compare-and-set under the existing `_locked_graph()` fcntl.LOCK_EX — flips in_progress=True only if still retry-merge/unblocked/non-terminal and unclaimed-or-stale-with-budget; persists owner token/pid/host/ heartbeat/claim-count/merge-context. `_run_child` consumes the return: False → does NOT run _merge_child_branch (yields to the owner; no double-merge, no terminalize-of-a-live-owner). One active merger at a time, cross-process. - Bounded stale-recovery: stale = same-host owner pid gone OR heartbeat/start exceeds the bounded timeout. take_ready reopens ONLY stale retry-merge entries as merge-only retries (never ordinary Lead dispatch); remaining claim budget → reclaim+resume from durable contract_amendment_merge_context; budget exhausted → structured merge_blocked. Composes with N4's per-(leaf,contract) cap. Never deadlocks, never double-merges, never re-dispatches as ordinary. Net invariant: exactly one runner executes a leaf's merge-only retry at a time; crash/restart always resolves to pass or honest merge_blocked within bounded attempts. +2 regressions: concurrent-claim race (exactly one wins, loser doesn't merge); stale in-progress recovery (resume→pass or bounded→merge_blocked, never ordinary, never stuck). R2 restart-window + durable-verdict regressions still pass. Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 33 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. R1 (C2/I3/N4) + S0/S1 untouched. Codex-fixed (Codex-found via Impl Gate R3); Claude-reviewed. (Codex sub-docs research/plan-s2- amendment-retry-recovery.md included.) Tradeoff: conservative remote-host staleness handled via timeout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…(no false reclaim) S2 Gate R4 (final round, 1 minimal must-fix; everything else confirmed acceptable, residual NOTE-level). Heartbeat was written only at claim time, never refreshed, but _merge_child_branch can legitimately run ~1800s > the 15-min stale timeout → a LIVE long-running retry owner was falsely reclaimed by a second runner (the exact race R3 closed, reopened by long merges). Fix: owner-token-checked periodic heartbeat refresh (60s interval, well under the 15-min stale window) wrapping the awaited _merge_child_branch in the merge-only retry path. The refresher writes the heartbeat under _locked_graph() ONLY when owner==this child_session_id AND retry_in_progress AND retry_merge AND no terminal/blocked state landed (re-checked each tick; stops if owner/state no longer matches). try/finally cancels + awaits it (suppress CancelledError) on success/merge_blocked/re-block/exception — no leaked task, no post-terminal refresh. Dead/stalled owners still go stale and are bounded-recovered via the existing timeout (unchanged). +1 regression: live long-running heartbeating owner is NOT reclaimed by a second claim (CAS still False); existing dead-owner stale-recovery still recovers; R2/R3 regressions still pass. Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 34 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. R1/R2/R3 + S0/S1 untouched. Codex-fixed (Codex-found via Impl Gate R4); Claude-reviewed. Accepted NOTE-level residual: conservative remote/unknown-host stale timeout (dead remote owner waits out the bounded timeout before recovery). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…loop (kills the 1799s hang) Plan-Gate-APPROVED redesign step S4 (builds on S0/S1/S2). Directly fixes the user's original pain: the 1799s leaf repair-agent timeout that hung the iTracker capstone. Flips repro scene #2 GREEN; #3/#4 stay RED. - After a SCOPED conflict repair, `_merge_child_branch` runs integration smoke in DETECTION-ONLY mode — it no longer enters `_run_integration_smoke_preflight_with_repair`'s leaf repair loop for an out-of-scope / foundation clean-deploy failure. Both leaf-reachable entry points converted: the direct post-conflict path AND the stale-target `_repair_stale_target_and_retry_merge(run_smoke_preflight =True)` path. (Root/subtree integration smoke unchanged — not leaf.) - An out-of-scope/foundation clean-deploy failure now emits a correctly-owned foundation_repair_needed / integration_repair_needed that creates a RUNNABLE graph task and S2-blocks the leaf (reuses S2's set_contract_amendment_blocked lifecycle / atomic-claim / stale- recovery — repair_route distinguishes integration_smoke_repair from foundation contract amendments) — never a dangling event, never a 1799s leaf loop. - In-scope failures keep existing scoped repair (no behavior change). - v5_preflight_repair: scoped leaf conflict-repair prompts no longer demand the full acceptance oracle. Repro scene #2 oracle refined (RED-first, verified RED on eae1f3a / GREEN now — not weakened): asserts no leaf smoke-REPAIR loop from either entry point, a runnable correctly-owned repair-need, leaf S2-blocked (not merge_blocked), detection-only smoke allowed. Verified: scenes #1/#2/#5 GREEN, #3/#4 RED; 38 S0-S2+S4 ownership units GREEN; ruff clean; S0/S1/S2 untouched. Codex-implemented; Claude-reviewed (RED-on-old verified; scope confirmed). Pre-existing rot NOTE (NOT this redesign): committed test_v5_architect_retry.py patches otto.v5_runner.check_scaffold_compiles which was removed by e2329e9 (pre-session "agent-native repair Step 4") → AttributeError on 3 tests; plus the 4 test_v5_phase2 git-worktree-rot failures. Both predate + are unrelated to S0-S4 and are entangled with the user's 4 uncommitted route-isolation dirty files (deliberately NOT committed here). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…nded pathless terminal, scoped in-scope fallback
S4 Gate R1 (1 CRITICAL + 1 IMPORTANT). The gate caught that S4 was
broken on the REAL path (tests used pathful fakes):
- CRITICAL: CleanOracleIssue.paths were dropped by
preflight_issues_from_clean_oracle / PreflightIssue (no path field) /
smoke serialization → S4's classifier saw REAL failures as pathless →
always out-of-scope → empty-bound contract_amendment (rejects all
writes) + cap-check-key('') vs increment-key('integration_smoke
_repair') mismatch → cap never trips → the 1799s stuck-cycle
re-emerged through S2 tasks. Fixed: added optional PreflightIssue.paths
(legacy None preserved; constructors/consumers audited), threaded
CleanOracleIssue.paths → PreflightIssue.paths →
_preflight_issue_payload → _smoke_payload_paths, plus a robust
fallback reading clean_oracle_result.issues[].paths. A genuinely
pathless smoke failure now terminalizes as honest structured
merge_blocked kind="integration_smoke_unrouteable" (never an
empty-bound amendment, never uncapped). Single consistent normalized
repair_path key used for BOTH the cap check and increment.
- IMPORTANT: the in-scope leaf smoke-repair fallback entered an
UNRESTRICTED full-oracle loop (no allowed_paths/scope_policy → prompt
demanded full acceptance oracle; commit hook only foundation-gated).
Fixed: in-scope fallback now passes allowed_paths=leaf.owned_paths +
scope_policy="allowed_paths", and the repair commit hook blocks any
changed path outside that allowlist before the foundation-contract
gate. A leaf smoke-repair can never widen beyond its owned paths.
+3 real-path regressions: clean-oracle serialization preserves paths
(RED on fa5c481 — old code had no PreflightIssue.paths, classifier
returned []); pathless smoke → bounded honest terminal (no empty
amendment); in-scope fallback packet + commit-hook enforce owned_paths
(inspects the real packet, not a monkeypatched call count).
Verified: scenes #1/#2/#5 GREEN, #3/#4 RED; 41 S0-S2+S4 ownership
units GREEN; ruff clean; S0/S1/S2 untouched. The 4 test_v5_phase2 +
committed test_v5_architect_retry check_scaffold_compiles-AttributeError
failures remain PRE-EXISTING rot (unrelated, entangled with the user's
uncommitted route-isolation work; deliberately not committed).
Codex-fixed (Codex-found via Impl Gate R1); Claude-reviewed.
Tradeoff: genuinely-pathless smoke failures terminalize immediately
(honest, actionable) rather than consuming retries against a synthetic
key.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 16, 2026
…from broad compile inputs) S4 Gate R2 (1 CRITICAL; R1 pathless/cap + in-scope-scoping confirmed CLOSED). py_compile set CleanOracleIssue.paths = ALL compiled files (command input set), not the causal failing file → S4's all-paths-must-overlap scope check made a leaf-owned syntax error look out-of-scope → misrouted an in-scope leaf bug to the wrong owner / under-scoped repair (and the first-sorted-path fallback guessed an arbitrary owner). Fixed at both sides of the seam: - Producer (otto/v5_clean_verify.py): new _py_compile_causal_paths parses the actual failing filename(s) from py_compile stderr/stdout; py_compile_failed.paths is now CAUSAL, not the broad input set. Audit: py_compile was the ONLY clean-oracle producer with the paths=command-input pattern; all others pass explicit/none. - Router (otto/v5_runner.py): no first-sorted-path guess. The contract-amendment write gate now supports MULTIPLE bound paths and smoke-repair scheduling owns/binds ALL causal paths; if causal paths are empty or cannot all be bound to the selected route → honest integration_smoke_unrouteable terminal (never under-scoped, never arbitrary-owner). Net invariant: leaf-owned causal failure stays in-scope (scoped leaf repair, unchanged); foundation/out-of-scope causal failure routes to the correct owner with ALL causal paths bound; indeterminate → honest-terminal; broad non-causal input paths never drive scope/routing. + real py_compile_failed multi-input regressions (leaf-owned causal → in-scope; foundation causal → routed+bound; indeterminate → unrouteable). The leaf regression directly exercises the d91cece bug (old paths=rel_files fails the causal-path assertion before routing). Verified: scenes #1/#2/#5 GREEN, #3/#4 RED; 44 S0-S2+S4 ownership units GREEN; broad suite only the known pre-existing test_v5_phase2 + the S5-RED scene #3 (no new regression); ruff clean. S0/S1/S2 + S4-R1 untouched. Codex-fixed (Codex-found via Impl Gate R2); Claude-reviewed. Tradeoff: broad compile inputs no longer kept as separate routing evidence (still inspectable via the recorded oracle command). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 20, 2026
… decomposition) Capstone run #1 evidence: root Lead decomposed CORRECTLY (5 children in graph: foundation + 4 features, all submit_subtask duplicate:false) in ~237s, then the run died "run_budget_seconds exceeded during root_decomposition (3000s)" at duration 642.2s — a 4.7x-early fire with a false label. Root cause (my P4 impl, not Codex's P1-P5 logic): both phase caps were hard-set to 240s — `spec_compile` (v5_runner.py:4007) and `root_decomposition` (:4059). 240s is far too tight: a real flat compile of a 47-feature product is ~6-10min and decomposition ~4-5min. The Lead emitted all 5 children at 237s and the 240s cap killed the phase ~3s later as it returned. `_V5RunDeadlineExceeded` also always reported `run_budget_seconds` (3000) regardless of which limit fired, masking the real cause. Fix: relax both phase caps 240 → 900s (generous hang-rails — healthy spec ~6min / decomp ~4min never killed; a genuine >15min hang still caught). run_budget_seconds remains the true total ceiling (P4 intent preserved: per-phase caps are hang detectors, the run budget is the real bound). `_await_with_run_deadline`/`_V5RunDeadlineExceeded` now report the ACTUAL fired timeout and whether it was a phase_cap vs run_budget — honest diagnostics so this can't mislabel again. Verified: syntax OK; ruff clean; 12 P1-P5 tests GREEN (incl P4 wall-clock). Claude-fixed from real run-#1 logs (no Codex — out of credit). basedpyright None.get@595-598 confirmed guarded (false positive); remaining strict-typer flags in P1-P5 are non-runtime noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
May 20, 2026
Run #6 (mib6-001619) was the first run to traverse the full pipeline (compile→decomp→scaffold→persist→4-feature concurrent fan-out→integration). It exposed the next-layer instance of the decomp-boundary class: 3 of 4 features merge_blocked at integration on a conflict in `backend/tests/conftest.py`. The conflict packet showed `base: ""` — every feature leaf independently CREATED its own conftest.py (each needs test fixtures), and the conflict-repair agent timed out (399s) trying to reconcile divergent creates of the same shared file. lead.md's architect guidance isolated route/API/screen registration but said nothing about shared TEST/BUILD infrastructure, which is the same kind of shared registry: a file every feature would otherwise each create or edit. Adds a rule (general, not conftest-specific): the scaffold MUST create shared test/build bootstrap (conftest.py, tests/setup.*, jest.config.*, shared DB/ session fixtures, shared mocks, shared lint/type config) and list it in shared_registry_files with leaf_edit:false; feature leaves add only their own test_<feature>.* modules under the extension globs and import the shared harness. Divergent independent creates of these files are the #1 integration merge-conflict cause. Prompt-only root-cause fix (no new deterministic validator predicate — that would re-introduce the brittle-predicate anti-pattern this campaign is removing). Pairs with 104522a (policy-label first-try gate) for the run #7 <45min test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
replan()after batch failures with richer failure analysis, retry strategies, context routing, and skip recommendationsreplan()on failure. Config flagpilot: falseto disable. Zero overhead when no failures.Status: NOT validated on real failures
The pilot never fired during benchmarking because the coding agent passed all tasks. This is expected — the pilot's value is at i2p scale (8+ tasks, multiple batches, partial failures). Shipping as a safe no-op upgrade.
What's new
otto/pilot.pyotto/orchestrator.pyotto/runner.pytests/test_pilot.pytests/test_pilot_benchmark.pybench/pilot-benchmark.shbench/pressure/projects/pilot-test-*Design docs
docs/superpowers/specs/2026-03-29-gate-pilot.mddocs/superpowers/plans/2026-03-29-gate-pilot-stage1.mddocs/superpowers/specs/2026-03-26-otto-intent-to-product.mdTest plan
pilot.logon next failure.🤖 Generated with Claude Code